Machine-Learning-Based Prediction of Plant Cuticle–Air Partition Coefficients for Organic Pollutants: Revealing Mechanisms from a Molecular Structure Perspective

Accurately predicting plant cuticle–air partition coefficients (Kca) is essential for assessing the ecological risk of organic pollutants and elucidating their partitioning mechanisms. The current work collected 255 measured Kca values from 25 plant species and 106 compounds (dataset (I)) and averaged them to establish a dataset (dataset (II)) containing Kca values for 106 compounds. Machine-learning algorithms (multiple linear regression (MLR), multi-layer perceptron (MLP), k-nearest neighbors (KNN), and gradient-boosting decision tree (GBDT)) were applied to develop eight QSPR models for predicting Kca. The results showed that the developed models had a high goodness of fit, as well as good robustness and predictive performance. The GBDT-2 model (Radj2 = 0.925, QLOO2 = 0.756, QBOOT2 = 0.864, Rext2 = 0.837, Qext2 = 0.811, and CCC = 0.891) is recommended as the best model for predicting Kca due to its superior performance. Moreover, interpreting the GBDT-1 and GBDT-2 models based on the Shapley additive explanations (SHAP) method elucidated how molecular properties, such as molecular size, polarizability, and molecular complexity, affected the capacity of plant cuticles to adsorb organic pollutants in the air. The satisfactory performance of the developed models suggests that they have the potential for extensive applications in guiding the environmental fate of organic pollutants and promoting the progress of eco-friendly and sustainable chemical engineering.


Introduction
Natural and human activities release massive quantities of organic pollutants into the atmosphere [1].These pollutants are intercepted by terrestrial vegetation and transferred to terrestrial ecosystems, where they accumulate through the food chain to higher nutrient levels [2,3].The plant cuticle acts as the main interface for the exchange of organic pollutants between the air phase and the plant.It also serves as the main barrier to the interception of atmospheric pollutants [4].Apart from the uptake and release of organic pollutants, the plant cuticle is also recognized as an accumulation chamber for organic pollutants [5,6].The partitioning of organic compounds between the cuticle and air is of great interest due to airborne organic pollutants' strong affinity to plant cuticles and their potential toxicity [7].
The exchange of organic contaminants between plant cuticles and air is typically evaluated through the plant cuticle-air partition coefficient (K ca ) [8,9].The partition coefficient, K ca , is generally determined by the ratio of the equilibrium concentration of organic contaminants between the isolated cuticle membranes (CMs) or polymer matrix membranes (MX) and the air, indicating the plant's capacity to uptake airborne organic pollutants [10,11].It is important to note that, at present, there is a paucity of experimentally measured K ca values, and the interaction mechanisms of organic pollutants between plant cuticles and air are still obscure [12].To the best of our knowledge, only a few hundred K ca values for organic pollutants between a few plants and air have been measured experimentally [6].
the interaction between plant cuticles and organic pollutants.It is worth mentioning that the models developed in this study can estimate the accumulation of organics at the plant cuticle-air interface accurately even in the absence of experimental data, and they can provide valuable environmental information to guide the risk assessment and regulation of organic pollutants.

Development of the QSPR Model
The MLR algorithm selected the most appropriate descriptors from the pool of descriptors obtained from our initial screening and developed the QSPR models.Although the overall performance of the models improved as the number of descriptors increased (Figure S1), after the number of descriptors reached 5, descriptors with a VIF value greater than 10 were present in both datasets.Thus, the number of optimal descriptor combinations for both datasets was determined to be four after removing as much redundant information as possible for the QSPR models [25].From Figure 1, it can be observed that the descriptors are not excessively correlated with each other.The selected best descriptors and their associated statistical indicators are presented in Table 1.The p-values of these eight descriptors are <0.5, indicating that they are statistically significant.The linear QSPR models developed with the selected descriptors were as follows: MLR-1 (dataset (I)): log K ca = 2.006 VE1_L + 14.945 LLS_02 + 0.094 H_Dz(p) − 1.044 SpMax2_Bh(v) − 13.433 ( 1) nations (SHAP), thereby extracting some tacit or novel knowledge about the interaction between plant cuticles and organic pollutants.It is worth mentioning that the models de veloped in this study can estimate the accumulation of organics at the plant cuticle-ai interface accurately even in the absence of experimental data, and they can provide valu able environmental information to guide the risk assessment and regulation of organi pollutants.

Development of the QSPR Model
The MLR algorithm selected the most appropriate descriptors from the pool of de scriptors obtained from our initial screening and developed the QSPR models.Although the overall performance of the models improved as the number of descriptors increased (Figure S1), after the number of descriptors reached 5, descriptors with a VIF value greate than 10 were present in both datasets.Thus, the number of optimal descriptor combina tions for both datasets was determined to be four after removing as much redundant in formation as possible for the QSPR models [25].From Figure 1, it can be observed that the descriptors are not excessively correlated with each other.The selected best descriptor and their associated statistical indicators are presented in       Three nonlinear algorithms, namely MLP, KNN, and GBDT, were employed to explore the nonlinear relationship between the K ca of organic pollutants and their molecular structures, with a view to developing more accurate models.For the same dataset, the nonlinear models were trained using the same molecular descriptors as the linear models.Eventually, six nonlinear QSPR models were developed based on two datasets and three machine-learning algorithms.The results of the hyperparameter search for the models are presented in Table S1.The measured and predicted log K ca values for the MLR-1 model, MLP-1 model, KNN-1 model, and GBDT-1 model are provided in Table S2, while the values associated with the models developed based on dataset (II) are presented in Table S3.The most stringent model validation criteria in the field of QSPR research were employed in this study to assess the performance of models, including R 2 adj > 0.7, Q 2 LOO > 0.6, Q 2 BOOT > 0.6, R 2 ext > 0.7, Q 2 ext > 0.6, CCC > 0.85, and minimize error values [26,27].The values of the validation parameters for the eight QSPR models are shown in Table 2.As shown in Table 2, the R 2 adj (0.850-0.995) of the four models (MLR-1, MLP-1, KNN-1, and GBDT-1 model) developed based on dataset (I) exceeded the standard thresholds, demonstrating excellent goodness of fit.The stability parameters Q 2 LOO and Q 2 BOOT had values of 0.678-0.936and 0.676-0.964,respectively, above the acceptable thresholds, indicating that the models are statistically robust and have fair internal accuracy [28].In terms of external predictive power, the R 2 ext (0.790-0.911),Q 2 ext (0.784-0.902), and CCC (0.889-0.952) of these models were much greater than the strict standard thresholds, implying that all four models achieved fairly reasonable predictions [29].The measured and predicted log K ca values for the training and test sets are plotted in Figures 2a and 2b, respectively.For the training and test sets of each model, the data points followed a similar discrete pattern, with all of them being close to the 1:1 line, proving that these four models had high-level accuracy and prediction abilities [30,31].The consistency of the training and test set errors indicated that these models had similar internal and external prediction accuracies, confirming the excellent external prediction abilities of these models [30].Overall, both the linear and nonlinear QSPR models developed based on dataset (I) were acceptable.As shown in Table 2, the R 2 adj (0.850-0.995) of the four models (MLR-1, MLP-1, KN 1, and GBDT-1 model) developed based on dataset (I) exceeded the standard thresho demonstrating excellent goodness of fit.The stability parameters Q 2 LOO and Q 2 BOOT had ues of 0.678-0.936and 0.676-0.964,respectively, above the acceptable thresholds, indi ing that the models are statistically robust and have fair internal accuracy [28].In term external predictive power, the R 2 ext (0.790-0.911),Q 2 ext (0.784-0.902), and CCC (0.889-0.9 of these models were much greater than the strict standard thresholds, implying that four models achieved fairly reasonable predictions [29].The measured and predicted Kca values for the training and test sets are plotted in Figure 2a and Figure 2b, respectiv For the training and test sets of each model, the data points followed a similar disc pattern, with all of them being close to the 1:1 line, proving that these four models h high-level accuracy and prediction abilities [30,31].The consistency of the training a test set errors indicated that these models had similar internal and external predict accuracies, confirming the excellent external prediction abilities of these models [ Overall, both the linear and nonlinear QSPR models developed based on dataset (I) w acceptable.Multiple verification parameters were also calculated to evaluate the performanc the models (MLR-2, MLP-2, KNN-2, and GBDT-2 model) developed based on dataset (Table 2).The internal validation results derived from the use of the data points in training set were R  Multiple verification parameters were also calculated to evaluate the performance of the models (MLR-2, MLP-2, KNN-2, and GBDT-2 model) developed based on dataset (II) (Table 2).The internal validation results derived from the use of the data points in the training set were R 2 adj = 0.891-0.925,Q 2 LOO = 0.661-0.874,and Q 2 BOOT = 0.802-0.886,indicating that the models exhibited excellent internal predictability and stability.The external validation results derived from the use of the data points in the test set were R 2 ext = 0.821-0.887,Q 2 ext = 0.807-0.884,and CCC = 0.891-0.940,demonstrating the superior performance of the models in predicting external data.In addition, the error-based statistical metrics further demonstrated the "good" quality of these models in predicting log K ca values for the training set and test set.Scatter plots of the log K ca values measured and predicted by the four models developed using dataset (II) are shown in Figure 2c,d.The data points were more concentrated on the 1:1 line than those in dataset (I), indicating that the four models also have the appropriate ability to predict log K ca values.The results of our rigorous validation testing indicated that these four models performed satisfactorily in various aspects.

Applicability Domain
The Williams plots shown in Figure 3 and Figure S2 established the structure range of the compounds for which the model could reliably predict log K ca values.Although structural outliers were found in all eight models, most of them fell into the category of "good high leverage" points with low δ (|δ| ≤ 3).These points were predicted with high accuracy, ensuring the stability and generalization performance of the models and extending their applicability domains to some extent [32,33].The MLP-1, KNN-2, and GBDT-2 models contained one, two, and two structural outliers with |δ| > 3 as well (Figure S2a,f and Figure 3b), respectively.This may be caused by the unique structure of these three compounds and the limited structural representation of the selected descriptors [34].Information related to the response outliers in datasets (I) and (II) is listed in Tables S4 and S5.Decachlorobiphenyl (ID: 124) and Tetrachlorobiphenyl (ID: 129) in dataset (I) were detected as response outliers in all of the four models developed by the four algorithms.This inaccurate prediction might have been caused by the fact that the molecular descriptors did not capture information about the key effects of plant cuticle adsorption on these two compounds [33].In addition, the variability of the experimental values may also have an impact on the model predictions [35].For example, the K ca value for Hexachlorobenzene (ID: 74) was probably underestimated excessively by Gobas, McNeil, Lovett-Doust and Haffner [36], and, therefore, was not in the AD range of the KNN-1 model and the GBDT-1 model when compared to the value measured by Sabljic, Guesten, Schoenherr and Riederer [5] (log K ca = 4.30 versus 6.78~7.28).By taking the mean value instead, the models developed based on dataset (II) predicted the K ca value for Hexachlorobenzene (ID: 24) with good accuracy.Overall, the presence of most data points in the applicability domain demonstrated the validity and good performance of the models [37].
Q 2 ext = 0.807-0.884,and CCC = 0.891-0.940,demonstrating the superior performance of the models in predicting external data.In addition, the error-based statistical metrics further demonstrated the "good" quality of these models in predicting log Kca values for the training set and test set.Scatter plots of the log Kca values measured and predicted by the four models developed using dataset (II) are shown in Figure 2c,d.The data points were more concentrated on the 1:1 line than those in dataset (I), indicating that the four models also have the appropriate ability to predict log Kca values.The results of our rigorous validation testing indicated that these four models performed satisfactorily in various aspects.

Applicability Domain
The Williams plots shown in Figures 3 and S2 established the structure range of the compounds for which the model could reliably predict log Kca values.Although structural outliers were found in all eight models, most of them fell into the category of "good high leverage" points with low δ (|δ| ≤ 3).These points were predicted with high accuracy, ensuring the stability and generalization performance of the models and extending their applicability domains to some extent [32,33].The MLP-1, KNN-2, and GBDT-2 models contained one, two, and two structural outliers with |δ| > 3 as well (Figure S2a,f and Figure 3b), respectively.This may be caused by the unique structure of these three compounds and the limited structural representation of the selected descriptors [34].Information related to the response outliers in datasets (I) and (II) is listed in Tables S4 and S5.Decachlorobiphenyl (ID: 124) and Tetrachlorobiphenyl (ID: 129) in dataset (I) were detected as response outliers in all of the four models developed by the four algorithms.This inaccurate prediction might have been caused by the fact that the molecular descriptors did not capture information about the key effects of plant cuticle adsorption on these two compounds [33].In addition, the variability of the experimental values may also have an impact on the model predictions [35].For example, the Kca value for Hexachlorobenzene (ID: 74) was probably underestimated excessively by Gobas, McNeil, Lovett-Doust and Haffner [36], and, therefore, was not in the AD range of the KNN-1 model and the GBDT-1 model when compared to the value measured by Sabljic, Guesten, Schoenherr and Riederer [5] (log Kca = 4.30 versus 6.78~7.28).By taking the mean value instead, the models developed based on dataset (II) predicted the Kca value for Hexachlorobenzene (ID: 24) with good accuracy.Overall, the presence of most data points in the applicability domain demonstrated the validity and good performance of the models [37].

Mechanism Interpretation
In addition to assessing the performance of each model, SHAP values were analyzed to interpret both the GBDT-1 model and the GBDT-2 model and determine whether the models were consistent with known mechanisms [38].The SHAP summary plot shown in Figure 4 succinctly and clearly shows the relationship between the SHAP values and each descriptor, making it easy to discern whether there is a positive or negative correlation between the descriptor and log K ca .For example, compounds with a large VE1_L (red color) typically have higher SHAP values, indicating higher log K ca values.Thus, the correlation between VE1_L and log K ca can be considered positive.The descriptors are sorted on the y-axis in descending order by the mean of the absolute value of SHAP, i.e., the descriptor contribution.

Mechanism Interpretation
In addition to assessing the performance of each model, SHAP values were analyzed to interpret both the GBDT-1 model and the GBDT-2 model and determine whether the models were consistent with known mechanisms [38].The SHAP summary plot shown in Figure 4 succinctly and clearly shows the relationship between the SHAP values and each descriptor, making it easy to discern whether there is a positive or negative correlation between the descriptor and log Kca.For example, compounds with a large VE1_L (red color) typically have higher SHAP values, indicating higher log Kca values.Thus, the correlation between VE1_L and log Kca can be considered positive.The descriptors are sorted on the y-axis in descending order by the mean of the absolute value of SHAP, i.e., the descriptor contribution.Figure 4 shows that the contributions of LLS_02 and SpMax2_Bh(v) to log Kca are consistently in the top four positions in both the GBDT-1 model and GBDT-2 model, suggesting that these two descriptors have a dominant effect on the plant cuticle's capacity to adsorb organic pollutants.The descriptor LLS_02 is a lead-like score modified by Monge, Arrault, Marot and Morin-Allory [39], and it was used to screen compounds that qualify as leads in drug discovery [40].The score was determined based on eight rules [39]: compounds that met all the rules had an LLS_02 value equal to 1, and the more rules that were violated, the lower the LLS_02 value [41].From these eight rules, it was found that compounds with lower LLS_02 values had high molecular weights, as well as a high number of hydrogen bond donors and hydrogen bond acceptors.High-molecular-weight compounds are usually difficult to pass through phospholipid membranes, and increasing the number of hydrogen bond donors and hydrogen bond acceptors makes these compounds more hydrophilic [42].Therefore, a decrease in log Kca could be expected for compounds with lower LLS_02 values.The descriptor SpMax2_Bh(v) was largest eigenvalue n. 2 of Burden matrix weighted by van der Waals volume, and belonged to Burden eigenvalues [43].Burden eigenvalues were calculated from the Burden matrix Bh(w), and the diagonal elements of the adjacency matrix are van der Waals volume.These values were related to molecular branching, the presence of heteroatoms, and bond multiplicity [44,45].The van Figure 4 shows that the contributions of LLS_02 and SpMax2_Bh(v) to log K ca are consistently in the top four positions in both the GBDT-1 model and GBDT-2 model, suggesting that these two descriptors have a dominant effect on the plant cuticle's capacity to adsorb organic pollutants.The descriptor LLS_02 is a lead-like score modified by Monge, Arrault, Marot and Morin-Allory [39], and it was used to screen compounds that qualify as leads in drug discovery [40].The score was determined based on eight rules [39]: compounds that met all the rules had an LLS_02 value equal to 1, and the more rules that were violated, the lower the LLS_02 value [41].From these eight rules, it was found that compounds with lower LLS_02 values had high molecular weights, as well as a high number of hydrogen bond donors and hydrogen bond acceptors.High-molecularweight compounds are usually difficult to pass through phospholipid membranes, and increasing the number of hydrogen bond donors and hydrogen bond acceptors makes these compounds more hydrophilic [42].Therefore, a decrease in log K ca could be expected for compounds with lower LLS_02 values.The descriptor SpMax2_Bh(v) was largest eigenvalue n. 2 of Burden matrix weighted by van der Waals volume, and belonged to Burden eigenvalues [43].Burden eigenvalues were calculated from the Burden matrix Bh(w), and the diagonal elements of the adjacency matrix are van der Waals volume.These values were related to molecular branching, the presence of heteroatoms, and bond multiplicity [44,45].The van der Waals volume contributed to the lipophilicity of the molecule, and the increase in the SpMax2_Bh(v) value logically should have led to an increase in the K ca value [46].However, as presented in Figure 4a,b, how the SpMax2_Bh(v) values regulated the partitioning of organic pollutants between plant cuticles and the air was complicated.This might be due to the limited number of data points making it difficult for the models to adequately respond to the mechanisms involved.Similar to LLS_02, LLS_01 is a lead-like score.The score was determined by six rules (molecular weight, hydrogen bond donors, hydrogen bond acceptors, number of rotatable bonds, etc.) [47].Therefore, there was also a positive correlation between LLS_01 and log K ca .
The VE1_L, H_Dz(p), and SpPos_A belonged to the category of 2D matrix-based descriptors.The VE1_L was based on the Laplace matrix, which provides the number of spanning trees for molecular graphing.This quantity reflected the structural complexity of polycyclic molecules, with higher spanning-tree quantities indicating greater molecular structural complexity [48].Generally, compounds with polycyclic structures were more stable and conducive to the adsorption of compounds by the plant cuticle, meaning that they should have shown increased K ca values [29,49].The descriptor H_Dz(p) was based on the Barysz matrix, weighted by the polarizability [50].The Barysz matrix was considered to be related to the presence of heteroatoms and multiple bonds in molecules [48].The magnitude of the polarizability was influenced by molecular size, structure, and electron distribution [51].The greater the polarization rate, the stronger the polarity of the molecules, and the stronger the intermolecular interactions [52].Molecules tended to distribute in the plant cuticle through intermolecular forces, leading to an increase in K ca values [53].The SpPos_A was calculated by summing the positive eigenvalues from the adjacency matrix, encoding information about molecular size, molecular branching, and molecular complexity [54][55][56].Figure 4 indicates that log K ca is proportional to SpPos_A.
According to the results of this study, molecular weight, molecular size, molecular branching, molecular complexity, the number of hydrogen bond donors, the number of hydrogen bond acceptors, the presence of heteroatoms, bond multiplicity, polycyclic structure, and polarizability are the main molecular structure features that affect the capacity of plant cuticles to adsorb airborne organic pollutants.

Model Comparison
To determine which model was most effective at predicting K ca values, cumulative distribution plots of the residuals (Figure S3) were plotted to depict the predictive effectiveness of the eight models.As shown in Figure S3, if the residuals fall in the −1 to 1 interval with a higher percentage, the model predicts the log K ca values more accurately.Our comparison among the four models developed based on dataset (I) indicated that the GBDT-1 model was significantly better than the other three models, while the GBDT-2 model was found to show the best performance among the four models developed based on dataset (II).The superiority of the GBDT-1 and GBDT-2 models in predicting the log K ca values can also clearly be observed in the scatter plots of the real predicted values (Figure 2).Although the GBDT-1 model performed better than the GBDT-2 model in terms of the validation parameters (Table 2), the repetitive data points in dataset (I) exacerbated the risk of data leakage.Therefore, the GBDT-2 model is recommended as a useful tool for predicting the K ca values of organic pollutants.
To further evaluate the eight QSPR models, several existing models for predicting log K ca values were collected and compared.The differences in datasets, descriptors, and validation metrics between the different studies reporting these models increased the difficulty of the comparison.The details of the comparison are presented in Table S6.The number of compounds simulated in the early studies was not more than a hundred; the types of compounds were concentrated, and the distribution behavior was more regular, resulting in a high fitting accuracy for their models [2,10].Eddula, Xu, Jiang, Huang, Tirumala, Liu, Acree and Abraham [6] collected 215 measured K ca values and established pp-LFER models with excellent accuracy.However, the application of these models was limited by the number of descriptors.The existing studies applied only the MLR algorithm to develop models for predicting K ca values and did not fully exploit any powerful nonlinear algorithms.In contrast, the QSPR models developed in this study showed excellent prediction accuracy while fitting more measured K ca values.Further statistical analyses adequately demonstrated the reliability of these models.In addition, the nonlinear relationship between the plant cuticle-air partition coefficients and molecular descriptors of the compounds was established for the first time in this study.

Dataset Preparation
The data points for the experimental log K ca dataset (dataset (I)), consisting of 255 data points measured for 106 compounds and 25 plant species, were collected from existing studies.Specifically, for compounds whose K ca measured values had not been directly reported but their plant cuticle/water partition coefficient (K cw ) and gas/water partition coefficients (K aw ) had been reported, their log K ca value was calculated as follows: log K ca = log K cw − log K aw [10].The plant species and tissue types used for the experimental measurements and the sources corresponding to each data point are listed in Table S7.The difference in the log K ca values measured for each compound in different tissues of different plant species was very small, and the presence of these similar data points significantly increased the risk of data leakage.Therefore, the log K ca values from different plant species and different tissue types were averaged in this study, resulting in dataset (II) containing the measured log K ca values of 106 compounds (Table S8).The mean log K ca values with standard deviation were 6.30 ± 3.14 for dataset (I) and 5.79 ± 3.15 for dataset (II).Following data quality testing (Figure 5), it was concluded that both dataset (I) and dataset (II) meet the Pauta criterion [57].
the number of descriptors.The existing studies applied only the MLR algorithm to velop models for predicting Kca values and did not fully exploit any powerful nonlin algorithms.In contrast, the QSPR models developed in this study showed excellent p diction accuracy while fitting more measured Kca values.Further statistical analyses a quately demonstrated the reliability of these models.In addition, the nonlinear relati ship between the plant cuticle-air partition coefficients and molecular descriptors of compounds was established for the first time in this study.

Dataset Preparation
The data points for the experimental log Kca dataset (dataset (I)), consisting of data points measured for 106 compounds and 25 plant species, were collected from ex ing studies.Specifically, for compounds whose Kca measured values had not been dire reported but their plant cuticle/water partition coefficient (Kcw) and gas/water partit coefficients (Kaw) had been reported, their log Kca value was calculated as follows: log = log Kcw − log Kaw [10].The plant species and tissue types used for the experimental me urements and the sources corresponding to each data point are listed in Table S7.T difference in the log Kca values measured for each compound in different tissues of diff ent plant species was very small, and the presence of these similar data points significan increased the risk of data leakage.Therefore, the log Kca values from different plant spec and different tissue types were averaged in this study, resulting in dataset (II) contain the measured log Kca values of 106 compounds (Table S8).The mean log Kca values w standard deviation were 6.30 ± 3.14 for dataset (I) and 5.79 ± 3.15 for dataset (II).Follow data quality testing (Figure 5), it was concluded that both dataset (I) and dataset (II) m the Pauta criterion [57].

Descriptor Generation and Filtering
The molecular structure characterization information of the compounds was descri through molecular descriptors [58].After determining the molecular structures of the co pounds, the geometric structures of the compounds were fully optimized using MM2 mo ular mechanics [59].A descriptor pool consisting of 5290 molecular descriptors was genera using alvaDesc software (Version 1.0.8)[60].Preliminary descriptor screening was carried in two steps, the first of which involved removing descriptors with missing, constant, a near-constant values, and the second of which involved identifying descriptors with pairw correlations greater than 0.9 and removing one of the interrelated descriptors in order to duce redundant information and make it easier to use the model in the future [61,62].

Descriptor Generation and Filtering
The molecular structure characterization information of the compounds was described through molecular descriptors [58].After determining the molecular structures of the compounds, the geometric structures of the compounds were fully optimized using MM2 molecular mechanics [59].A descriptor pool consisting of 5290 molecular descriptors was generated using alvaDesc software (Version 1.0.8)[60].Preliminary descriptor screening was carried out in two steps, the first of which involved removing descriptors with missing, constant, and near-constant values, and the second of which involved identifying descriptors with pairwise correlations greater than 0.9 and removing one of the interrelated descriptors in order to reduce redundant information and make it easier to use the model in the future [61,62].The remaining 165 and 163 descriptors of dataset (I) and dataset (II), respectively, were further filtered using the MLR algorithm in the next step.Detailed information about the steps of our further screening is presented below.

Model Development and Validation
The Y-ranking method was used to divide each dataset into a training set and a test set [63].The training and test sets consisted of 80% and 20% log K ca measured values and the corresponding molecular descriptors in the dataset, respectively [64,65].The training sets were used to develop and internally validate each model, whereas the test set was solely dedicated to validating each model's performance in predicting the K ca for new compounds [66].One linear algorithm (MLR) and three common nonlinear algorithms-MLP, KNN, and GBDT-were used to develop models for both datasets, resulting in 8 QSPR models.These 8 models were named in the format of "modeling algorithm + dataset number".For example, the QSPR model developed using the KNN algorithm based on dataset (I) was named the KNN-1 model.Further details on these four machine-learning algorithms are provided in Text S1.
Identifying and selecting descriptors that contribute significantly to the dependent variable is essential for QSPR modeling.In this study, further screening of the descriptors was carried out using stepwise MLR in SPSS (Version 20.0) software [67].Generally, the model developed using the best descriptor combination should have a high R 2 adj and Q 2 ext and the lowest possible number of descriptors [68].Additionally, multicollinearity between descriptors was assessed using the variance inflation factor (VIF).The VIF of each descriptor should be less than 10 to avoid excessive inter-correlation between descriptors [69].
The MLR-1 and MLR-2 models were developed by establishing the relationship between measured log K ca values and the best descriptor combinations based on dataset (I) and dataset (II), respectively.The machine-learning library scikit-learn in Python (Version 3.9.6)was utilized to train the nonlinear QSPR models developed by the other three machine-learning algorithms [70].With the help of the GridSearchCV function in the sklearn library, a grid search and fivefold cross-validation were performed to optimize various hyperparameters of models [71].Table S1 lists these hyperparameters and their ranges for each model, as well as the modules for the different machine-learning algorithm implementations.Each modeling process of different ML algorithms took the best combination of descriptors screened by MLR as the independent variable to ensure consistency in the comparison among the models.
In accordance with the fourth principle of the OECD guidelines, assessing the fit, stability, and predictive performance of QSPR models with a wide range of internally and externally validated statistical parameters was essential for understanding the predictive quality of new compounds and ensuring the reliability of the developed models [72].R 2 adj , MAE tra , RMSE tra , and s tra were used to measure the goodness of fit of the models [73,74].Internal robustness was characterized by performing leave-one-out cross-validation and bootstrap cross-validation based on Q 2 LOO and Q 2 BOOT [75][76][77].Each model's external predictive ability was assessed based on R 2 ext , Q 2 ext , CCC, and three error-based metrics [78,79].The leverage value method was employed to limit the compound structure space in which the model could reliably predict log K ca [28].In addition, Williams plots of leverage values (h i ) versus normalized residuals (δ) were applied to visualize the applicability domains of the QSPR models.In addition, response outliers (the point with |δ| > 3) and structural outliers (the point with h > h*) could be clearly identified from the plots [80,81].The formulas for h i , h*, and δ are presented in Text S3.

Model Interpretation
Understanding how dominant features affect model predictions is another important principle in QSPR model development [24].The SHAP method was applied to explain the models developed using the GBDT algorithm and determine the effect of specific structural features of molecules on plant cuticles' adsorption of airborne organic pollutants [82].The SHAP value of a feature is determined by the average of the feature's contribution across all possible feature alignments in the feature set [38].It measures the degree and direction of the descriptor's contribution to the prediction result: higher absolute SHAP values indicate a higher contribution, and whether a SHAP value demonstrates positivity or negativity corresponds to the positive and negative impact of the descriptor on the prediction result [83,84].The global importance of a feature is reflected by averaging the 204, R 2 adj = 0.873, Q 2 LOO = 0.869, Q 2 BOOT = 0.872, RMSE tra = 1.101; n ext = 51, R 2 ext = 0.839, Q 2 ext = 0.835, RMSE ext = 1.296.MLR-2 (dataset (II)): log K ca = 1.325SpPos_A + 12.317 LLS_02 + 7.385 LLS_01 − 1.517 SpMax2_Bh(v) − 15.724 (2) n tra = 84, R 2 adj = 0.891, Q 2 LOO = 0.874, Q 2 BOOT = 0.886, RMSE tra = 0.987; n ext = 22, R 2 ext = 0.833, Q 2 ext = 0.807, RMSE ext = 1.466.

Figure 2 .
Figure 2. Plots of the observed versus predicted for log Kca based on dataset (I).(a) Training set Test set; Plots of the observed versus predicted for log Kca based on dataset (II).(c) Training set Test set.

Figure 2 .
Figure 2. Plots of the observed versus predicted for log K ca based on dataset (I).(a) Training set; (b) Test set; Plots of the observed versus predicted for log K ca based on dataset (II).(c) Training set; (d) Test set.

Figure 3 .
Figure 3. Application domain characterized by Williams plots: the MLR-1 (a) and GBDT-2 (b) models for log K ca .

Figure 4 .
Figure 4. Relationship between SHAP value and the values of different descriptors for dataset (I) (a) and dataset (II) (b).

Figure 4 .
Figure 4. Relationship between SHAP value and the values of different descriptors for dataset (I) (a) and dataset (II) (b).

Figure 5 .
Figure 5. Distribution of experimental data of log Kca values.(a) Distribution of 255 experime log Kca values for 106 compounds; (b) Distribution of average experimental log Kca values for compounds.

Figure 5 .
Figure 5. Distribution of experimental data of log K ca values.(a) Distribution of 255 experimental log K ca values for 106 compounds; (b) Distribution of average experimental log K ca values for 106 compounds.

Table 1 .
Descriptions involved in the final MLR models and corresponding statistical significance (p) and variance inflation factors (VIF) values.